7 research outputs found

    Tandem approach for information fusion in audio visual speech recognition

    Get PDF
    Speech is the most frequently preferred medium for humans to interact with their environment making it an ideal instrument for human-computer interfaces. However, for the speech recognition systems to be more prevalent in real life applications, high recognition accuracy together with speaker independency and robustness to hostile conditions is necessary. One of the main preoccupation for speech recognition systems is acoustic noise. Audio Visual Speech Recognition systems intend to overcome the noise problem utilizing visual speech information generally extracted from the face or in particular the lip region. Visual speech information is known to be a complementary source for speech perception and is not impacted by acoustic noise. This advantage brings in two additional issues into the task which are visual feature extraction and information fusion. There is extensive research on both issues but an admissable level of success has not been reached yet. This work concentrates on the issue of information fusion and proposes a novel methodology. The aim of the proposed technique is to deploy a preliminary decision stage at frame level as an initial stage and feed the Hidden Markov Model with the output posterior probabilities as in tandem HMM approach. First, classification is performed for each modality separately. Sequentially, the individual classifiers of each modality are combined to obtain posterior probability vectors corresponding to each speech frame. The purpose of using a preliminary stage is to integrate acoustic and visual data for maximum class separability. Hidden Markov Models are employed as the second stage of modelling because of their ability to handle temporal evolutions of data. The proposed approach is investigated in a speaker independent scenario for digit recognition with the existence of diverse levels of car noise. The method is compared with a principal information fusion framework in audio visual speech recognition which is Multiple Stream Hidden Markov Models (MSHMM). The results on M2VTS database show that the novel method achieves resembling performance with less processing time as compared to MSHMM

    Audio-visual speech recognition in vehicular noise using a multi-classifier approach

    Get PDF
    Speech recognition accuracy can be increased and noise robustness can be improved by taking advantage of the visual speech information acquired from the lip region. To combine audio and visual information sources, efficient information fusion techniques are required. In this paper, we propose a novel SVM-HMM tandem hybrid feature extraction and combination method for an audio-visual speech recognition system. From each stream, multiple one-versus-rest support vector machine (SVM) binary classifiers are trained where each word is considered as a class in a limited vocabulary speech recognition scenario. The outputs of the binary classifiers are treated as a vector of features to be combined with the vector from the other stream and new combining binary classifiers are built. The outputs of the classifiers are used as observed features in hidden Markov models (HMM) representing words. The whole process can be considered as a nonlinear feature dimension reduction system which extracts highly discriminatory features from limited amounts of training data. To simulate the performance of the system in a real-world environment, we add vehicular noise at different SNRs to speech data and perform extensive experiments

    Görsel–işitsel konuşma tanıma’da veri kaynaştırma teknikleri (Information fusion techniques in audio-visual speech recognition)

    Get PDF
    It is well known that human perception of speech relies both on audio and visual information. However, the physiology of information fusion process in humans is still indefinite which attracts scientists' attention to information fusion process for audio-visual speech recognition. In this work, a novel tandem hybrid approach is introduced for an efficient audio-visual speech recognition system and the performance of the proposed technique is experimentally compared with the widely used Multiple Stream Hidden Markov Model (MSHMM) approach

    Lip segmentation using adaptive color space training

    Get PDF
    In audio-visual speech recognition (AVSR), it is beneficial to use lip boundary information in addition to texture-dependent features. In this paper, we propose an automatic lip segmentation method that can be used in AVSR systems. The algorithm consists of the following steps: face detection, lip corners extraction, adaptive color space training for lip and non-lip regions using Gaussian mixture models (GMMs), and curve evolution using level-set formulation based on region and image gradients fields. Region-based fields are obtained using adapted GMM likelihoods. We have tested the proposed algorithm on a database (SU-TAV) of 100 facial images and obtained objective performance results by comparing automatic lip segmentations with hand-marked ground truth segmentations. Experimental results are promising and much work has to be done to improve the robustness of the proposed method
    corecore